The Acquisition of Lexical Semantic Knowledge from Large Corpora
نویسنده
چکیده
Machine-readable dictionaries provide the raw material from which to construct computationaily useful representations of the generic vocabulary contained within it. Many sublanguages, however, are poorly represented in on-line dictionaries, ff represented at all. Vocabularies geared to specialized domains are necessary for many applications, such as t ex t categorization and information retrieval. In this paper I describe research devoted to developing techniques for building sublanguage lexicons via syntactic and statistical corpus analysis coupled with analytic techniques based on the tenets of a generative lexicon. 1. I n t r o d u c t i o n Machine-readable dictionaries provide the raw material from which to construct computationally useful representations of the generic vocabulary contained within it. Many sublanguages, however, are poorly represented in on-line dictionaries, if represented at all (cf. Grishman et al (1986)). Yet vocabularies geared to specialized domains are necessary for many applications, such as text categorization and information retrieval. In this paper I describe research devoted to developing techniques for building sublanguage lexicons via syntactic and statistical corpus analysis coupled with analytic techniques based on the tenets of a generative theory of the lexicon
منابع مشابه
An Application of Lexical Semantics to Knowledge Acquisition from Corpora
In this paper, we describe a program of research designed to explore.' how a lexical semantic theory may be exploited for extracting information from corpora suitable for use in Information Retrieval applications. Unlike with purely statistical collocational analyses, the framework of a semantic theory allows the ~ultomatic construction of predictions about semantic relationships among words ap...
متن کاملCombining NLP and statistical techniques for lexical acquisition
The growing availability of large on-line corpora encourages the study of word behaviour directly from accessible raw texts. However the methods by which lexical knowledge should be extracted from plain texts are still matter of debate and experimentation. In this paper it is presented an integrated tool for lexical acquisition from corpora, ARIOSTO, based on a hybrid methodology that combines ...
متن کاملLexical Database for Multiple Languages: Multilingual Word Semantic Network
Data mining and knowledge engineering have become a tough task due to the availability of large amount of data in the web nowadays. Validity and reliability of data also become a main debate in knowledge acquisition. Besides, acquiring knowledge from different languages has become another concern. There are many language translators and corpora developed but the function of these translators an...
متن کاملCorpus-Based Induction of Lexical Representation and Meaning
The acquisition of linguistic knowledge, i.e., the identication, extraction, and encoding of linguistic information in a corpus, has been one of the main motivations for data-driven approaches to natural language. Methods have been developed for the acquisition of, for instance, parts of speech, noun compounds, collocations, support verbs, subcategorization frames, phrase structure rules, selec...
متن کاملLexical acquisition from corpora: the case of subcategorization frames in French
We present in this paper a method to automatically acquire a syntactic lexicon of subcategorization frames for French verbs directly from large corpora. The method is evaluated against existing lexical resources: we show that our system is capable of producing new frames that were not previously registered. Lastly, we show that it is possible to induce lexico-semantic classes « à la Levin » (19...
متن کاملUsing Web Corpora for the Automatic Acquisition of Lexical-Semantic Knowledge
This article presents two case studies to explore whether and how web corpora can be used to automatically acquire lexical-semantic knowledge from distributional information. For this purpose, we compare three German web corpora and a traditional newspaper corpus on modelling two types of semantic relatedness: (1) Assuming that free word associations are semantically related to their stimuli, w...
متن کامل